Sampling clustering based on multi-view attribute structural relations

In light of the exponential growth in information volume, the significance of graph data has intensified. Graph clustering plays a pivotal role in graph data processing by jointly modeling the graph structure and node attributes. Notably, the practical significance of multi-view graph clustering is heightened due to the presence of diverse relationships within real-world graph data. Nonetheless, prevailing graph clustering techniques, predominantly grounded in deep learning neural networks, face challenges in effectively handling multi-view graph data. These challenges include the incapability to concurrently explore the relationships between multiple view structures and node attributes, as well as difficulties in processing multi-view graph data with varying features. To tackle these issues, this research proposes a straightforward yet effective multi-view graph clustering approach known as SLMGC. This approach uses graph filtering to filter noise, reduces computational complexity by extracting samples based on node importance, enhances clustering representations through graph contrastive regularization, and achieves the final clustering outcomes using a self-training clustering algorithm. Notably, unlike neural network algorithms, this approach avoids the need for intricate parameter settings. Comprehensive experiments validate the supremacy of the SLMGC approach in multi-view graph clustering endeavors when contrasted with prevailing deep neural network techniques.


Introduction
Graph clustering involves partitioning a graph into several disjoint clusters of nodes [1].Multi-view clustering, building upon graph clustering, leverages richer graph information by seeking consistent clustering results through multiple view relationships.Graph data techniques find widespread applications in various practical situations, such as group segmentation [2], social graphs [3], sentiment analysis [4][5][6] and the traffic classification [7,8].
Multi-view graph clustering has evolved from single-view clustering, with LINE [9] and GAE [10] being two representative algorithms in this context.LINE aims to map nodes in a graph to a low-dimensional space, making adjacent nodes closer in the embedding space.It achieves this by maximizing the similarity of positive samples and minimizing the similarity of negative samples.GAE employs the idea of autoencoders to learn node embeddings while preserving the structural information of the graph.The encoder part of the autoencoder maps nodes to a low-dimensional space, and the decoder attempts to reconstruct the original graph from this low-dimensional representation.However, real-world graph data often involves more relationships, and single-view clustering may not leverage these additional connections to explore deeper information.
Early multi-view clustering techniques can be broadly categorized into two types.One approach involves obtaining a consensus graph from multiple views and applying a singleview algorithm to it, such as RMSC, PwMC, and SwMC [11][12][13].Another approach utilizes graph embedding to obtain compact representations of nodes from multi-view data and then applies classical clustering algorithms, such as PMNE, mvn2vec, and SMNE [14][15][16].However, these algorithms fail to simultaneously leverage both node attributes and graph relationships.
In recent years, inspired by Graph Convolutional Networks (GCN) [17], two types of multi-view attribute graph clustering algorithms have emerged.One2Multi (O2MA) [18] posits that there exists one view containing the most information among multiple views.Therefore, O2MA employs a graph autoencoder based on one view to embed nodes and reconstruct multiple views.However, this approach fails to fully leverage the structural relationships between different views.MAGCN [19] primarily deals with the clustering of multiple node attribute graphs under a single structural relationship.Clearly, they are not effective in handling graph data with multiple node attributes and multiple view structures.
The graph learning module can effectively address the issue of simultaneously leveraging node attributes and structural relationships.MAGC [20], MCGC [21], and HMvC [22] employ graph learning modules to address the computationally complex nature of neural network parameters.However, their computational efficiency still needs improvement for large datasets, and the use of traditional clustering methods during final clustering results in insufficient stability.
Existing multi-view clustering algorithms have the following shortcomings: 1. Real-world graph data often contains noise or missing values.Deep neural networks heavily rely on the quality of raw graph data, and they lack interpretability.
2. Using neural network methods for multi-view clustering with large samples is computationally complex, time-consuming, and memory-intensive.
3. In the final clustering stage, many existing multi-view clustering methods utilize traditional clustering methods such as k-means.However, these clustering methods exhibit high randomness, leading to significant variations in results with each computation.
In an effort to overcome the mentioned constraints, we introduce a novel approach for multi-view clustering, referred to as SLMGC.The complete structure of SLMGC is illustrated in Fig 1 .In this paper, we provide a comprehensive algorithmic explanation of the clustering algorithm and conduct a thorough analysis of its individual modules.The primary contributions of this investigation can be outlined as follows: 1. Graph filtering is employed as a replacement for Graph Convolutional Neural Networks (GCN) to obtain node embeddings from the feature matrix.This approach mitigates the impact of noise in the initial data on the eventual clustering outcomes.

2.
A sampling algorithm is utilized to select a batch of nodes, which effectively reduces computational costs and diminishes the impact of outliers on the clustering outcomes.

3.
A contrastive loss is used as a regularization term, enabling the utilization of both structural information and features from various views to construct the consensus graph.Moreover, a self-training clustering algorithm is designed to enhance the stability of the final clustering results and reduce clustering bias commonly observed in traditional approaches.

Relevant concepts and definitions
Multi-view data refers to data composed of multiple relationships represented in multiple views, where each node in the graph corresponds to a sample point.The edges between node pairs in each view represent the relationships between those nodes in that specific view.The relationships in each view are represented by adjacency matrices, and each node is associated with its own attributes or features, represented by vectors.The feature vectors of all nodes collectively form the feature matrix for that particular view.

Multi-view graph data
Let G ¼ fV; E 1 ; . . .; E V ; X 1 ; . . .; X V g denote the multi-view data, where the set of N nodes is denoted as V, e ij 2 E v indicates the presence of an edge among node i and node j in the v-th view and belongs to the set E v :X v ¼ fx v 1 ; . . .; x v N g T 2 R N�d v represents the feature matrix of the v-th view, comprising N attribute vectors of length d v .

Laplacian matrix
The relationship structure of each view, i.e., the presence of edges, can be represented using the adjacency matrix If there exists an edge between node i and node j, e a v lj ¼ 1, otherwise, e a v lj ¼ 0. D v is the degree matrix of the adjacency matrix f A v .Considering the self-expressive property of nodes in the graph, where a node can be linearly represented by its neighboring nodes, we introduce the normalized adjacency matrix denoted 2 , its associated graph Laplacian matrix represented as L v = I − A v , where I is the identity matrix.

Graph filtering
In real-world graph data, neighboring nodes often exhibit certain similarities in their features.Therefore, we introduce the concept of graph filtering [23].To facilitate the subsequent discussions, we first focus on the single-view scenario.The feature matrix X 2 R N×d can be viewed as a collection of N graph signal vectors, denoted as f.The Laplacian matrix can be decomposed as L = UΛU −1 , where Λ = diag(λ 1 , � � �, λ N ) represents the increasing eigenvalues, and U = [u 1 , � � �, u N ] corresponds to the related orthogonal eigenvectors.The graph filter can be expressed as G f = Up(Λ)U −1 2 R N×N , where p(Λ) = diag(p(λ 1 ), � � �, p(λ N )) denotes the frequency-response function [24].The graph filtering operation can be defined as the multiplication between the graph signal and the graph filter: The filtered graph signal is denoted as � f .To facilitate clustering, we desire nodes within the same cluster to possess similar feature values across all dimensions.Based on this assumption about clusters, we consider that nodes closer in distance are more likely to belong to the same cluster.However, directly applying graph filters to the feature matrix may not fully exploit the graph's structural information, as first-order graph filters only smooth the neighboring nodes within one hop.Therefore, we consider the use of k-th order graph filtering to capture longer-distance graph structural information.We specify the k-th order graph filtering as follows: Where � X represents the feature matrix after filtering.

Graph learning
Considering that real-world graph data often contain noise and missing values, directly applying spectral clustering to � X may not yield satisfactory clustering results.Utilizing the selfexpressive characteristic of graph data [25], where each node can be expressed as a linear combination of other nodes, we learn a similarity graph Z from � X.The coefficients of node combinations represent the relationships between nodes, which can also be viewed as distances in classical clustering tasks.The objective function for single-view is as follows: Where α > 0is a weight parameter, Z 2 R N×N is the consensus graph matrix.The primary component represents the reconstruction loss, and the secondary component represents the regularization term.
To handle multi-view data, we apply graph filtering to each view's feature matrix, resulting in � X v .Subsequently, we extend Eq 3 by introducing a weight parameter for each view to determine their respective importance.Ultimately, we obtain the consensus graph for all views as follows: λ v represents the coefficient value for the v-th view, and ω < 0 is the smoothing factor.

Node sampling
In the case of large datasets with a considerable number of nodes, direct spectral clustering on the obtained consensus graph Z could lead to long computation times and high memory usage.Additionally, considering the impact of outliers may result in a decrease in clustering accuracy.To address these issues, we refrain from using the previously filtered � X and instead opt to extract m(m < N) key sample points that hold significance within the graph [26].
Classic sampling algorithms assume equal weights for each point, but in graph data, different nodes hold varying levels of importance.Inspired by word sampling techniques in NLP [27], we perform sampling based on node importance, where nodes with a higher number of edges in each view are considered to be more important.We define qðiÞ ¼ Ãv ij as the function for measuring importance.The likelihood of each node i being the first sample in the sampling set M is given by: Where γ > 0. Subsequently, we employ a non-replacement sampling algorithm to select the remaining m-1 samples.Specifically, each remaining node i is selected as the next sample with a probability of p i /S j= 2M p j , until m nodes have been sampled.Now, we construct the feature matrix B = {b 1 , � � �, b m } T 2 R m×d using the sampled node feature vectors, and B is a part of � X.Ultimately, we acquire a reduced consensus graph matrix S 2 R m×N through the learning process, which represents the similarity between the m sampled nodes and all N nodes.As a result, we can reformulate Eq 4 as follows:

Graph contrastive regularization
Contrastive learning has gained popularity in unsupervised tasks.The fundamental concept of contrastive learning is to optimize the similarity between positive pairs while increasing the distance between negative pairs.In this study, each node and its K-nearest neighbors (KNN) are considered as positive pairs, and the contrastive regularization term is applied in Eq 6 for learning, resulting in the final consensus graph S. It can be represented as follows: x Where K v i denotes the K-nearest neighbors of node i in the v-th view.This regularization term is designed to enhance the similarity among positive pairs and diminish the similarity among negative pairs.Ultimately, our algorithm can be represented as follows:

Self-training clustering
Existing multi-view graph clustering algorithms mostly directly apply k-means or spectral clustering to the obtained consensus graph to acquire the final clustering labels.However, these algorithms suffer from significant randomness.Inspired by the DEC algorithm [28], we adopt self-training clustering to obtain the final labels, significantly improving the clustering stability.
We first define the normalized consensus graph Ŝ ¼ W À 1=2 S, and matrix W is the diagonal matrix composed of the sums of each row in S. We conduct Singular Value Decomposition (SVD) on the matrix ŜŜ T to acquire the largest p singular values and their corresponding left and right singular vectors, denoted as YSL T .Here, S = diag(σ 1 , � � �, σ p ) represents the singular values, while Y 2 R m×p are the left singular vectors and L 2 R m×p are the right singular vectors, respectively.We compute the final clustering matrix C ¼ S À 1 2 Y T Ŝ, and then perform self-training clustering on C = C T [29].
We improve clustering iteratively by matching the target distribution through soft clustering.The clustering loss function is defined as follows: Where KL(�k�) denotes the Kullback-Leibler divergence, Q represents the soft clustering labels, P represents the target distribution, and q ij is the metric based on Student's t-distribution [30].
It measures the resemblance among cluster C i and cluster center μ i and can be understood as the likelihood of allocating sample i to cluster j: In Eq 9, the target distribution P is obtained by squaring the term q and then normalizing it.It is defined as: Finally, our clustering labels are given by: Where q ij is calculated using Eq 10.If the change in labels of the target distribution between consecutive updates is less than the threshold δ, the training is terminated.We obtain the clustering results based on the previous iteration's Q.

Optimization
In Eq 8, there are two sets of variables S and λ v .We utilize an alternating optimization approach, we keep one variable fixed while updating the other.
• Fix λ v , update S Treating λ v as a constant, the optimization objective for S is as follows: We employ the gradient descent algorithm to solve for S, and the gradient of S can be decomposed into two parts.The first part is given by: 2 Let n be the total sum of the number of neighbors for all nodes, then the second term is: Then, we utilize the Adam optimization algorithm [31] to update S. To improve convergence speed, We get an initial S + by Eq 6.
• Fix S, update λ v The loss function for λ v is given by: Setting its derivative to 0, we can obtain: We alternately update S and λ v until convergence.The entire process is outlined in Algorithm 1.
For the obtained similarity graph Z using Eq 4, we can directly perform spectral clustering to achieve the ultimate clustering result.Nevertheless, this algorithm has a time complexity of O(N 3 ) and significant memory overhead, which is not suitable for scenarios with large datasets.Instead, we utilize node sampling algorithm to obtain a smaller similarity graph S using Eq 8 with a time complexity of only O(m 3 ).Regarding the time cost of the gradient descent is O (tVmn + 2tVm), while generating the final clustering matrix Y and E is O(m 3 ) and O(m 2 N) respectively, and the time cost of self-training clustering is O(tm 2 ).In summary, our algorithm has higher efficiency compared to traditional graph learning algorithms.
Algorithm 1: SLMGC The graph filtering order k, and the parameters α, ω, γ, as well as the number of clusters p.

Result: Clustering label 1 compute the normalized adjacency matrix
2 ; 2 compute the Laplacian matrix L v = I − A v ; 3 perform graph filtering on each view's feature matrix according to Eq 2; 4 extract m samples and represent their indices as "ind."; 5 select m rows from the graph-filtered � X using the indices "ind" to construct a new feature matrix B; 6 while not converged do 7 Use the Adam optimization algorithm to update S; 8 for each view do 9 update λ v using Eq 17. calculate the soft clustering labels Q and the target distribution P; 16 calculate the loss using Eq 9. 17 end 18 return Q; 19 Obtain the final clustering labels using Eq 12.

Dataset and evaluation metrics
We choose five datasets to assess our experiments.Among these datasets, ACM, DBLP, and IMDB consist of a feature matrix and several adjacency matrices.Amazon Photo and Amazon Computer [32] consist of multiple feature matrices and one adjacency matrix.The statistical information of the dataset statistics are presented in Table 1.
• ACM: It is a paper network.We construct two views based on the co-paper (papers authored by the same authors) relationship and co-subject (papers with the same subjects) relationship.The paper features are represented as bag-of-words elements composed of keywords.We utilize the academic disciplines or subject areas of the papers as clustering labels; • DBLP: It is an author network.It consists of three types of relationships: co-authorship (authors who have jointly authored papers), co-conference (authors who have published papers in the identical conferences), and co-term (authors who have utilized identical terminologies in their respective papers).The author features are represented as bag-of-words elements composed of keywords.We utilize the academic disciplines or subject areas of the authors as clustering labels; • IMDB: This is a movie network.It employs two connections, co-actor (movies with common actors) and co-director (movies directed by common directors), to build a dual-view representation.The movie features are represented as bag-of-words elements composed of plots.We use movie genres as clustering label; • Amazon Photos and Amazon Computers: They are portions of the Amazon co-purchase network dataset, where nodes correspond to products, and each product's features are represented as bag-of-words from product comments.The relationships in the views are based on co-purchasing of products, and the clustering labels are product categories.To acquire multi-view attributes, the second feature matrix is formed by taking the Cartesian product.
We employ four commonly used metrics to showcase the effectiveness of our approach: Accuracy (ACC), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI) and F1 score (F1).

Experimental setup and comparison models
The computer configuration used in this experiment is as follows: CPU is AMD Ryzen 5 4600H with Radeon Graphics, 6 cores and 12 threads, operating at 3.00GHz.The memory is 16GB at 3200MHz, and the GPU is NVIDIA GeForce GTX 1650 with 4GB of memory.
To validate our effectiveness, we compare SLMGC with several typical models.Among them, LINE and GAE are two traditional single-view clustering algorithm, and we average their results across each view to obtain the final metrics.PMNE is a multi-view clustering algorithm.RMSC is a robust multi-view spectral clustering algorithm utilizing Markov chains.PwMC and SwMC introduce weighted mechanisms for clustering multi-view data.O2MAC and O2MA are multi-view attribute graph clustering algorithms utilizing graph autoencoders.MAGCN is a multi-view attribute graph convolutional network, MAGC and MCGC are multi-view clustering algorithms that utilize graph learning modules.

Experiment result
Tables 2 and 3 present the clustering results.In most measurements, our algorithm outperforms the reference algorithms on the ACM, DBLP, IMDB, Amazon Photo, and Amazon Computer datasets.

Run time comparison
Next, we compared the runtime of SLMGC with two top-performing deep neural network algorithms, O2MAC and MAGCN, as well as graph learning algorithms MAGC and MCGC on five datasets.The specific performance is presented in Table 4.

Ablation study
In this part, we conducted a series of experiments to investigate the impact of each parameter in the algorithm, comprising the order of graph filtering k, the number of sampled points m and its parameter γ for sample extraction, as well as the parameters α and ω in graph contrastive learning.
In theory, the higher the order of graph filtering, the better it can capture global information.Therefore, we set k=[0, 2, 5, 10, 100] and performed t-SNE visualization on node features under different filter orders on the DBLP dataset, as shown in Fig 2 .It can be observed that graph filtering indeed processes the node features effectively.However, a higher filter order does not necessarily lead to better results; excessively high filter orders can make the node features too similar, making it difficult to distinguish between them.Through experiments, we have found that a filter order of 2 is a preferable choice.
In order to assess the efficacy of graph filtering, we performed comparative experiments on three datasets: ACM, DBLP, and IMDB.We compared the results obtained without using graph filtering and with using 2nd-order graph filtering.It can be observed that graph filtering indeed improves the final clustering results, demonstrating the effectiveness of graph filtering.
The results are presented in Table 5.
In the previous sections, we proposed the theory of sample extraction, which can reduce computation time, save memory overhead, and avoid the influence of outliers on the final clustering.Next, we conduct experiments and analysis on the sample extraction size in the ACM, DBLP, and IMDB datasets.We start by sampling every 500 nodes downwards from the total number of original nodes.It is observed that too few sample points can result in insufficient information, affecting the final clustering metrics.Through experiments, we found that sample extraction size around 2/3 of the total number of nodes yields the best performance.See Fig 3.
The parameter γ has a very small impact on sample extraction, as shown in Fig 4 .In our experiments, we use a slightly superior value of 4 for γ.
Finally, we conduct experiments on the two parameters in graph contrastive learning.We set α= [1,5,10,100,1000] and ω=[-1,-2,-3,-4,-5], and observe their effects on the four evaluation metrics on the DBLP dataset.As shown in Fig 5, our algorithm performs less favorably under higher α values, but it is not sensitive to lower α values and ω.This demonstrates the robustness of our algorithm within a certain range of these parameters, making it applicable and meaningful in practical scenarios.

Discussion
Through experiments on the ACM, DBLP, and IMDB datasets, we found that O2MA, MAGC, MCGC, and our SLMGC algorithm outperform single-view algorithms GAE and LINE.This is because single-view algorithms cannot leverage multiple-view relationships, demonstrating the superiority of multi-view clustering.However, early multi-view clustering methods such as PMNE, RMSC, PwMC, and SwMC show poor performance as they cannot simultaneously utilize node attributes and structural relationships.
It can be observed that the three algorithms using graph learning modules perform better than several methods using deep neural networks.This is attributed to the fact that graph learning modules can simultaneously leverage node attributes and structural relationships  while fully considering information from all views.In the case of our method within the graph learning module algorithm, it does not perform as well as MAGC and MCGC on the IMDB dataset.This is because our node sampling method, in datasets with numerous nodes, discards a significant number of nodes, leading to a loss of accuracy.
On the Amazon Photo and Amazon Computer datasets, our SLMGC algorithm exhibits a significant advantage compared to the deep neural network algorithm MAGCN.This is because our algorithm considers not only various node attributes but also the structural relationships in the graph.However, our algorithm has a considerable disadvantage compared to the other two graph learning module algorithms on the Amazon Computer dataset.This is because, in samples with multiple node attributes, discarding more nodes results in greater loss.
Although our algorithm sacrifices some accuracy through sampling, it significantly outperforms in terms of computational time.The comparison of runtime indicates that graph

Conclusions
Existing graph clustering algorithms heavily rely on deep learning networks, and multi-view clustering is still in its early stage with many unresolved issues.This paper proposes a sampling-based graph learning multi-view clustering algorithm.We introduce graph filtering to reduce noise in the original graphs, followed by extracting a subset of node samples based on their importance.This approach reduces computational cost while maintaining clustering accuracy.Moreover, we incorporate a graph contrastive regularization term to enhance the graph learning module.Finally, we employ self-training clustering to reduce potential errors in traditional clustering algorithms.We compare our algorithm with popular and well-performing deep learning graph clustering algorithms on five datasets, and the experimental results demonstrate the superiority of our proposed approach.